Introduction

SBA - Small Business Profiles for the States and Territories

The Office of Advocacy’s Small Business Profiles are an annual analysis of each state’s small business activities. Each profile gathers the latest information from key federal data-gathering agencies to provide a snapshot of small business health and economic activity. This year’s profiles report on state economic growth and employment; small business employment, industry composition, and turnover; plus business owner demographics and county-level employment change.

https://www.sba.gov/

In [1]:
from IPython.core.display import display, HTML
display(HTML("""<style> .container {width:96% !important;}</style>"""))

from IPython.display import IFrame
In [2]:
import pandas as pd
import multiprocessing
import numpy as np
from multiprocessing.dummy import Pool as ThreadPool
from functools import partial
import math

# Handle s3 or local
import s3fs
from os import listdir
from os.path import isfile, join
import subprocess

Dataset

This Dataset from the U.S. Small Business Administration (SBA) can be download from this website

https://www.sba.gov/advocacy/small-business-profiles-states-and-territories-2016

Experiment:

Assess the pros and cons of the most popular libraries to read pdf's

Path to the files

In [7]:
import Tools
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-7-30cf977aff11> in <module>()
----> 1 import Tools

ImportError: No module named Tools

Files description

In [4]:
def list_files(path,ext = 'pdf'):
    if path.startswith('s3://'):  
        onlyfiles = subprocess.check_output(['aws', 's3', 'ls', s3])
        onlyfiles = onlyfiles.split('\n')
        onlyfiles = [f.split(" ")[-1] for f in onlyfiles]
    else:
        onlyfiles = [f for f in listdir(local) if isfile(join(local, f))]
    onlyfiles = [f for f in onlyfiles if f.endswith('.{}'.format(ext))]
    files = [f.replace('.{}'.format(ext),'') for f in onlyfiles]
    return files
In [5]:
def path(path,name,ext = 'pdf'):
    path_file = '{}{}.{}'.format(path,name,ext)
    return path_file

The pdfs

Screen%20Shot%202018-11-15%20at%208.51.49%20AM.png

Screen%20Shot%202018-11-15%20at%208.46.13%20AM.png

Loading the file with PyPDF

In [ ]:
import PyPDF2 # import PdfFileReader
In [224]:
def load_pdf(path_file):
    
    def get_content(fp_in):
        content = []
        pdf = PyPDF2.PdfFileReader(fp_in)
        number_of_pages = pdf.getNumPages()
        for i in xrange(number_of_pages):
            page = pdf.getPage(i).extractText().split()
            content.append(page)
        return content
    
    if path_file.startswith('s3://'):  
        fs = s3fs.S3FileSystem()
        with fs.open(path_file, 'rb') as fp_in:
            content = get_content(fp_in)

    else:
        fp_in = file(path_file,'rb')
        content = get_content(fp_in)

    return content
In [225]:
%%time
files = list_files(path_local)[1]
path_file = path(path_local,files)
file_pdf = load_pdf(path_file)
/experiments/_home/racostap/Alabama.pdf
CPU times: user 232 ms, sys: 12 ms, total: 244 ms
Wall time: 234 ms
In [226]:
for fp in file_pdf:
    print fp 
    print '\n'
[u'AlabamaSmallBusiness,2016', u'5', u'SBAofAdvocacy', u'ALABAMA', u'382,524', u'SmallBusinesses', u'765,293', u'SmallBusinessEmployees', u'96.7%', u'ofAlabamaBusinesses', u'47.7%', u'ofAlabamaEmployees', u'EMPLOYMENT', u'5,734', u'netnewjobs', u'1', u'DIVERSITY', u'30.7%', u'increaseinminority', u'ownership', u'2', u'TRADE', u'81.2%', u'ofAlabamaexporters', u'3', u'O', u'VERALL', u'A', u'LABAMA', u'E', u'CONOMY', u'\u0141', u'Inthethirdquarterof2015,Alabamagrewatanannualrateof', u'2.2%', u'whichwasfasterthantheoverallUSgrowthrateof', u'1.9%', u".Bycomparison,Alabama's2014growthof", u'3.6%', u'wasupfromthe2013levelof', u'3.1%', u'.(Source:', u'BEA', u')', u'\u0141', u'Atthecloseof2015,unemploymentwas', u'6.3%', u',upfrom', u'6.1%', u'atthecloseof2014.Thiswasabovethenationalunem-', u'ploymentrateof', u'5.0%', u'.(Source:', u'CPS', u')', u'E', u'MPLOYMENT', u'\u0141', u'Alabamasmallbusinessesemployed', u'765,293', u'people,or', u'47.7%', u'oftheprivateworkforce,in2013.(Source:', u'SUSB', u')', u'\u0141', u'Firmswithfewerthan100employeeshavethelargestshare', u'ofsmallbusinessemployment.SeeFigure1forfurtherde-', u'tailsonmswithemployees.(Source:', u'SUSB', u')', u'\u0141', u'Private-sectoremploymentincreased', u'1.3%', u'in2015.This', u"wasbelowthepreviousyear'sincreaseof", u'1.7%', u'.(Source:', u'CES', u')', u'\u0141', u'Thenumberofproprietorsincreasedin2014by', u'1.4%', u'rela-', u'tivetothepreviousyear.(Source:', u'BEA', u')', u'\u0141', u'Smallbusinessescreated', u'5,734', u'netjobsin2013.Among', u'thesevenBDSsize-classes,msemploying50to99em-', u'ployeesexperiencedthelargestgains,adding', u'3,417', u'net', u'jobs.Thelargestlosseswereinmsemploying250to499', u'employeeswhichlost', u'1,016', u'netjobs.(Source:', u'BDS', u')', u'Figure1:', u'AlabamaEmploymentbyFirmSize', u'17.0%', u'16.5%', u'14.3%', u'52.3%', u'1-19Employees', u'20-99Employees', u'100-499Employees', u'>500Employees', u'2013', u'0.0', u'500.0K', u'1.0M', u'1.5M', u'2000', u'2010', u'[', u"TheSmallBusinessareproducedbytheUSSmallBusinessAdministration'sofAdvocacy.Eachreportincorporatesthemostup-", u'to-dategovernmentdatatopresentauniquesnapshotofsmallbusinesses.', u'Smallbusinessesareasemployingfewerthan500', u'employees', u'.HyperlinkstodatasourcesandreportgenerationinformationareprovidedinTable3.', u'1,3', u'Netsmallbusinessjobschangeandexportersharearebasedonnewlyreleased2013BDSand2012ITAdata.', u'2', u'Diversitystatistictrackschangesbetween2007and2012basedontheSurveyofBusinessOwners(SBO)2015release.']


[u'AlabamaSmallBusiness,2016', u'6', u'SBAofAdvocacy', u'I', u'NCOMEAND', u'F', u'INANCE', u'\u0141', u'ThenumberofbanksreportedintheCallReportsbetweenJune2014andJune2015declined.(Source:', u'FDIC', u')', u'\u0141', u'In2014,', u'53,528', u'loansunder$100,000(andvaluedat', u'$', u'887.3million', u')wereissuedbyAlabamalendinginstitutionsreporting', u'undertheCommunityReinvestmentAct.(Source:', u'FFIEC', u')', u'\u0141', u'Themedianincome', u'4', u'forindividualswhowereself-employedattheirownincorporatedbusinesseswas', u'$', u'48,900', u'in2014.', u'Forindividualsself-employedattheirownunincorporatedms,thiswas', u'$', u'20,463', u'.(Source:', u'ACS', u')', u'[', u'4', u'Medianincomerepresentsearningsfromallsources.Unincorporatedself-employmentincomeincludesunpaidfamilyworkers,averysmall', u'percentoftheunincorporatedself-employed.', u'B', u'USINESS', u'O', u'WNER', u'D', u'EMOGRAPHICS', u'Figure2:', u'AlabamaChangesinBusiness', u'OwnershipbyDemographicGroup', u'AfricanAmerican-owned', u'28.7%', u'Asian-owned', u'35.4%', u'Hawaiian/PIslander-owned', u'-16.9%', u'Hispanic-owned', u'51.5%', u'NativeAmerican/Alaskan-owned', u'27.0%', u'Minority-owned', u'30.7%', u'Nonminority-owned', u'-8.6%', u'Figure3:', u'AlabamaSelf-Employmentwithin', u'DemographicGroup', u'5.7%', u'10.6%', u'5.2%', u'10.0%', u'Female', u'Male', u'Minority', u'Veteran', u'\u0141', u'Figure2displaysthechangeinoverallmownershipforeachdemographicgroupfrom2007to2012basedonthe', u'SurveyofBusinessOwners(', u'SBO', u')forAlabama,releasedinDecember2015.', u'\u0141', u'Figure3displaysthepercentofeachdemographicgroupasself-employedaccordingtothe2014American', u'CommunitySurvey(', u'ACS', u')5-yearestimates.', u'B', u'USINESS', u'T', u'URNOVER', u'\u0141', u'Inthesecondquarterof2014,', u'2,270', u'establishments', u'startedup', u'5', u'inAlabamaand', u'2,376', u'exited.', u'6', u'Startupsgen-', u'erated', u'9,675', u'newjobswhileexitscaused', u'8,698', u'joblosses.', u'(Source:', u'BDM', u')', u'\u0141', u'Figure4displaysstartupandexitratesfrom2005to2015.', u'Eachseriesissmoothedacrossmultiplequarterstohigh-', u'lightlong-runtrends.(Source:', u'BDM', u')', u'[', u'5', u'STARTUPS', u'arecountedwhenbusinessestablishmentshireatleast', u'oneemployeeforthetime.TheBLStermsthese', u'births', u',asdistinct', u'fromtheBLS', u'openings', u'categorywhichincludesseasonalre-openings.', u'6', u'EXITS', u'occurwhenestablishmentsgofromhavingatleastoneem-', u'ployeetohavingnone,andthenremainclosedforatleastayear.The', u'BLStermstheseevents', u'deaths', u',asdistinctfromthe', u'closings', u'category', u'whichincludesseasonalshutterings.', u'Figure4:', u'AlabamaPrivateStartupandExit', u'Rates', u'2.3%', u'2.4%', u'2.5%', u'2.6%', u'2.7%', u'2006', u'2009', u'2012', u'2015', u'exitrate', u'startuprate']


[u'AlabamaSmallBusiness,2016', u'7', u'SBAofAdvocacy', u'I', u'NTERNATIONAL', u'T', u'RADE', u'\u0141', u'Atotalof', u'3,964', u'companiesexportedgoodsfromAlabamain2013.Amongthese,', u'3,218', u',or', u'81.2%', u',weresmallms;they', u'generated', u'15.8%', u"ofAlabama'stotalknownexportvalue.(Source:", u'ITA', u')', u'S', u'MALL', u'B', u'USINESSESBY', u'I', u'NDUSTRY', u'Table1:', u'AlabamaSmallFirmsbyIndustry,2013', u'(sortedbysmallemployerms)', u'Industry', u'1\u0152499', u'Employees', u'1\u015219', u'Employees', u'Nonemployer', u'Firms', u'TotalSmall', u'Firms', u'RetailTrade', u'10,674', u'9,627', u'27,992', u'38,666', u'OtherServices(exceptPublicAdministration)', u'10,042', u'9,332', u'63,575', u'73,617', u'Professional,,andTechnicalServices', u'8,081', u'7,378', u'31,099', u'39,180', u'HealthCareandSocialAssistance', u'7,823', u'6,670', u'21,808', u'29,631', u'Construction', u'7,143', u'6,373', u'39,463', u'46,606', u'AccommodationandFoodServices', u'5,525', u'4,255', u'4,889', u'10,414', u'WholesaleTrade', u'3,785', u'2,974', u'5,061', u'8,846', u'Manufacturing', u'3,377', u'2,349', u'4,425', u'7,802', u'Administrative,Support,andWasteManagement', u'3,355', u'2,842', u'37,265', u'40,620', u'FinanceandInsurance', u'2,916', u'2,582', u'7,842', u'10,758', u'RealEstateandRentalandLeasing', u'2,799', u'2,590', u'29,081', u'31,880', u'TransportationandWarehousing', u'2,197', u'1,834', u'12,669', u'14,866', u'Arts,Entertainment,andRecreation', u'1,003', u'860', u'11,253', u'12,256', u'Agriculture,Forestry,FishingandHunting', u'768', u'715', u'4,378', u'5,146', u'EducationalServices', u'746', u'574', u'6,894', u'7,640', u'Information', u'617', u'489', u'2,930', u'3,547', u'Mining,Quarrying,andOilandGasExtraction', u'149', u'103', u'698', u'847', u'Utilities', u'92', u'64', u'256', u'348', u'Total', u'71,092', u'61,611', u'311,578', u'382,670', u'[', u"TotalsforTables1and2differfromSUSB'sstatewidetalliesduetomswithestablishmentsinmorethanoneindustryandtheomissionofindustry", u'notreportedbyNES.(Source:NESandSUSB)', u'sIndicatessamplesdeemedtoosmalltorepresentthepopulationaccordingtoSUSB.']


[u'AlabamaSmallBusiness,2016', u'8', u'SBAofAdvocacy', u'S', u'MALL', u'B', u'USINESS', u'E', u'MPLOYMENTBY', u'I', u'NDUSTRY', u'Table2:', u'AlabamaEmploymentbyIndustryandFirmSize,2013', u'(sortedbysmallmemployment)', u'Industry', u'SmallBusiness', u'Employment', u'TotalPrivate', u'Employment', u'SmallBusiness', u'EmploymentShare', u'HealthCareandSocialAssistance', u'113,580', u'240,549', u'47.2%', u'AccommodationandFoodServices', u'89,707', u'161,421', u'55.6%', u'RetailTrade', u'87,257', u'222,277', u'39.3%', u'Manufacturing', u'79,632', u'242,093', u'32.9%', u'OtherServices(exceptPublicAdministration)', u'68,770', u'80,073', u'85.9%', u'Construction', u'65,147', u'78,318', u'83.2%', u'Professional,,andTechnicalServices', u'57,856', u'92,520', u'62.5%', u'Administrative,Support,andWasteManagement', u'44,577', u'133,720', u'33.3%', u'WholesaleTrade', u'44,232', u'72,175', u'61.3%', u'FinanceandInsurance', u'24,832', u'69,332', u'35.8%', u'TransportationandWarehousing', u'24,484', u'58,471', u'41.9%', u'RealEstateandRentalandLeasing', u'15,577', u'23,257', u'67.0%', u'EducationalServices', u'13,791', u'28,969', u'47.6%', u'Arts,Entertainment,andRecreation', u'11,858', u'17,165', u'69.1%', u'Information', u'9,854', u'34,447', u'28.6%', u'Agriculture,Forestry,FishingandHunting', u'5,622', u'6,356', u'88.5%', u'Mining,Quarrying,andOilandGasExtraction', u'2,650', u'7,942', u'33.4%', u'Utilities', u'2,094', u'17,238', u'12.1%', u'Total', u'761,520', u'1,586,323', u'48.0%', u'Figure5:', u'AlabamaCounty-LevelJobChanges,', u'2015(CEW)', u'Table3:', u'AbbreviationsandResources', u'ACS', u'AmericanCommunitySurvey,USCensusBureau', u'BEA', u'BureauofEconomicAnalysis', u'BDM', u'BusinessEmploymentDynamics,BLS', u'BDS', u'BusinessDynamicsStatistics,USCensusBureau', u'BLS', u'BureauofLaborStatistics,USDepartmentofLabor', u'CES', u'CurrentEmploymentStatistics,BLS', u'CEW', u'CensusofEmploymentandWages,BLS', u'CPS', u'CurrentPopulationSurvey,BLS', u'FDIC', u'FederalDepositInsuranceCorporation', u'FFIEC', u'FederalFinancialInstitutionsExaminationCouncil', u'ITA', u'InternationalTradeAdministration', u'NES', u'NonemployerStatistics,USCensusBureau', u'SBO', u'SurveyofBusinessOwners,USCensusBureau', u'SUSB', u'StatisticsofUSBusinesses,USCensusBureau', u'All,sourcedata,methodologynotes,andcounty-level', u'employmentstatisticsareavailableat', u'http://go.usa.gov/cfKMd']


In [205]:
file_pdf[2]
Out[205]:
[u'AlabamaSmallBusiness,2016',
 u'7',
 u'SBAofAdvocacy',
 u'I',
 u'NTERNATIONAL',
 u'T',
 u'RADE',
 u'\u0141',
 u'Atotalof',
 u'3,964',
 u'companiesexportedgoodsfromAlabamain2013.Amongthese,',
 u'3,218',
 u',or',
 u'81.2%',
 u',weresmallms;they',
 u'generated',
 u'15.8%',
 u"ofAlabama'stotalknownexportvalue.(Source:",
 u'ITA',
 u')',
 u'S',
 u'MALL',
 u'B',
 u'USINESSESBY',
 u'I',
 u'NDUSTRY',
 u'Table1:',
 u'AlabamaSmallFirmsbyIndustry,2013',
 u'(sortedbysmallemployerms)',
 u'Industry',
 u'1\u0152499',
 u'Employees',
 u'1\u015219',
 u'Employees',
 u'Nonemployer',
 u'Firms',
 u'TotalSmall',
 u'Firms',
 u'RetailTrade',
 u'10,674',
 u'9,627',
 u'27,992',
 u'38,666',
 u'OtherServices(exceptPublicAdministration)',
 u'10,042',
 u'9,332',
 u'63,575',
 u'73,617',
 u'Professional,,andTechnicalServices',
 u'8,081',
 u'7,378',
 u'31,099',
 u'39,180',
 u'HealthCareandSocialAssistance',
 u'7,823',
 u'6,670',
 u'21,808',
 u'29,631',
 u'Construction',
 u'7,143',
 u'6,373',
 u'39,463',
 u'46,606',
 u'AccommodationandFoodServices',
 u'5,525',
 u'4,255',
 u'4,889',
 u'10,414',
 u'WholesaleTrade',
 u'3,785',
 u'2,974',
 u'5,061',
 u'8,846',
 u'Manufacturing',
 u'3,377',
 u'2,349',
 u'4,425',
 u'7,802',
 u'Administrative,Support,andWasteManagement',
 u'3,355',
 u'2,842',
 u'37,265',
 u'40,620',
 u'FinanceandInsurance',
 u'2,916',
 u'2,582',
 u'7,842',
 u'10,758',
 u'RealEstateandRentalandLeasing',
 u'2,799',
 u'2,590',
 u'29,081',
 u'31,880',
 u'TransportationandWarehousing',
 u'2,197',
 u'1,834',
 u'12,669',
 u'14,866',
 u'Arts,Entertainment,andRecreation',
 u'1,003',
 u'860',
 u'11,253',
 u'12,256',
 u'Agriculture,Forestry,FishingandHunting',
 u'768',
 u'715',
 u'4,378',
 u'5,146',
 u'EducationalServices',
 u'746',
 u'574',
 u'6,894',
 u'7,640',
 u'Information',
 u'617',
 u'489',
 u'2,930',
 u'3,547',
 u'Mining,Quarrying,andOilandGasExtraction',
 u'149',
 u'103',
 u'698',
 u'847',
 u'Utilities',
 u'92',
 u'64',
 u'256',
 u'348',
 u'Total',
 u'71,092',
 u'61,611',
 u'311,578',
 u'382,670',
 u'[',
 u"TotalsforTables1and2differfromSUSB'sstatewidetalliesduetomswithestablishmentsinmorethanoneindustryandtheomissionofindustry",
 u'notreportedbyNES.(Source:NESandSUSB)',
 u'sIndicatessamplesdeemedtoosmalltorepresentthepopulationaccordingtoSUSB.']

Loading the file with Tabula

In [212]:
import tabula
In [213]:
def load_pdf(path_file):
    if path_file.startswith('s3://'):  
        fs = s3fs.S3FileSystem()
        with fs.open(path_file, 'rb') as fp_in:
            pdf = .read_pdf(fp_in,multiple_tables=True)
    else:
        pdf = tabula.read_pdf(path_file,multiple_tables=True)
    return pdf
In [214]:
# tabula.read_pdf(file_path,multiple_tables=True, pages = 3)
In [216]:
%%time
files = list_files(path_local)[1]
path_file = path(path_local,files)
file_pdf = load_pdf(path_file)
/experiments/_home/racostap/Alabama.pdf
CPU times: user 4 ms, sys: 4 ms, total: 8 ms
Wall time: 2.53 s
In [211]:
file_pdf
Out[211]:
[      0                                                  1        2  \
 0   NaN  of small business employment. See Figure 1 for...      NaN   
 1   NaN      tails on firms with employees. (Source: SUSB)    1.5 M   
 2     •  Private-sector employment increased 1.3% in 20...      NaN   
 3   NaN  was below the previous year’s increase of 1.7%...      NaN   
 4   NaN                                               CES)    1.0 M   
 5     •  The number of proprietors increased in 2014 by...      NaN   
 6   NaN           tive to the previous year. (Source: BEA)      NaN   
 7   NaN                                                NaN      NaN   
 8   NaN                                                NaN  500.0 K   
 9   NaN                                                NaN      NaN   
 10    •  Small businesses created 5,734 net jobs in 201...      NaN   
 11  NaN  the seven BDS size-classes, firms employing 50...      NaN   
 12  NaN  ployees experienced the largest gains, adding ...      NaN   
 
                     3      4  
 0                 NaN    NaN  
 1                 NaN    NaN  
 2      >500 Employees  52.3%  
 3                 NaN    NaN  
 4                 NaN    NaN  
 5   100-499 Employees    NaN  
 6                 NaN    NaN  
 7                 NaN  14.3%  
 8                 NaN    NaN  
 9     20-99 Employees    NaN  
 10                NaN  16.5%  
 11                NaN    NaN  
 12     1-19 Employees  17.0%  ,
      0                                                  1  \
 0  NaN                                                NaN   
 1  NaN                                         EMPLOYMENT   
 2  NaN                                              5,734   
 3  NaN                                     net new jobs 1   
 4  NaN                            OVERALL ALABAMA ECONOMY   
 5    •  In the third quarter of 2015, Alabama grew at ...   
 6  NaN  1.9%. By comparison, Alabama’s 2014 growth of ...   
 7    •  At the close of 2015, unemployment was 6.3%, u...   
 8  NaN               ployment rate of 5.0%. (Source: CPS)   
 9  NaN                                         EMPLOYMENT   
 
                                                    2    3  
 0                                          DIVERSITY  NaN  
 1                                              TRADE  NaN  
 2                                        30.7% 81.2%  NaN  
 3  increase in minorityownership2 of Alabama expo...    3  
 4                                                NaN  NaN  
 5                                                NaN  NaN  
 6                                                NaN  NaN  
 7                  This was above the national unem-  NaN  
 8                                                NaN  NaN  
 9                                                NaN  NaN  ]

Loading the file with pdf_query

In [227]:
import pdfquery
In [11]:
def load_pdf(path_file):
    if path_file.startswith('s3://'):  
        fs = s3fs.S3FileSystem()
        with fs.open(path_file, 'rb') as fp_in:
            pdf = pdfquery.PDFQuery(fp_in)
            pdf.load()
    else:
        pdf = pdfquery.PDFQuery(path_file)
        pdf.load()        
    return pdf
In [149]:
%%time
files = list_files(path_local)[1]
path_file = path(path_local,files)
file_pdf = load_pdf(path_file)

Finding some text and retrieving the coordinates

Report for Alabama

Screen%20Shot%202018-11-14%20at%205.16.41%20PM.png

Report for Alaska

Screen%20Shot%202018-11-14%20at%207.52.45%20PM.png

In [13]:
def getCoordinates(pdf,query, type_search = "Line"):
        name = pdf.pq('LTText%sHorizontal:contains("%s")' % (type_search,query))
        for n in name:
            d = dict()
            d["left_corner"] = math.floor(float(n.layout.x0)* 1000)/1000.0
            d["bottom_corner"] = math.floor(float(n.layout.y0)* 1000)/1000.0
            d["right_corner"] = math.ceil(float(n.layout.x1)* 1000)/1000.0
            d["upper_corner"] = math.ceil(float(n.layout.y1)* 1000)/1000.0
            d["text"] = n.layout.get_text()
            d["pageid"] = int(float(n.iterancestors('LTPage').next().layout.pageid))
            yield d
In [14]:
g = getCoordinates(file_pdf,'Small Businesses', type_search='Line')
d = next(g,None)
d
Out[14]:
{'bottom_corner': 635.368,
 'left_corner': 103.344,
 'pageid': 1,
 'right_corner': 190.135,
 'text': u'Small Businesses\n',
 'upper_corner': 648.985}

Retrieving text around given a set of coordinates

In [16]:
file_pdf.pq(('LTPage[pageid="%s"] LTTextBoxHorizontal:overlaps_bbox("%f,%f,%f,%f")' % (d['pageid'],
                                                                                  d['left_corner'],
                                                                                  d['bottom_corner'],
                                                                                  d['right_corner'],
                                                                                  d['upper_corner']))).text()
Out[16]:
'Small Businesses\nof Alabama Businesses'
In [17]:
left_corner = 0
file_pdf.pq(('LTPage[pageid="%s"] LTTextBoxHorizontal:overlaps_bbox("%f,%f,%f,%f")' % (d['pageid'],
                                                                                  left_corner,
                                                                                  d['bottom_corner'],
                                                                                  d['right_corner'],
                                                                                  d['upper_corner']))).text()
Out[17]:
'382,524\n96.7% Small Businesses\nof Alabama Businesses'

Reading several fields all at once

In [18]:
KeyFigures = ['EMPLOYMENT',
              'DIVERSITY',
              'TRADE']    
delta_bottom = 30

Info = [('with_formatter', 'text')]

for kf in KeyFigures:
    g = getCoordinates(pdf=file_pdf,query=kf,type_search="Box")
    d = next(g,None)
    Info.append(tuple((kf,'LTPage[pageid="%s"] LTTextBoxHorizontal:overlaps_bbox("%f,%f,%f,%f")'%(d['pageid'],
                                                                                                   d["left_corner"],
                                                                                                   d["bottom_corner"]-delta_bottom,
                                                                                                   d["right_corner"],
                                                                                                   d["upper_corner"]))))
    info = file_pdf.extract(Info)
info
Out[18]:
{'DIVERSITY': 'DIVERSITY 30.7% increase in minority ownership2',
 'EMPLOYMENT': 'EMPLOYMENT 5,734 net new jobs1',
 'TRADE': 'TRADE\n81.2% of Alabama exporters3'}

A better example

Screen%20Shot%202018-11-14%20at%207.43.40%20PM.png

In [19]:
def info1(file_pdf):
    col_right_align = 300
    DemographicGroup = ['American-owned',
                        'Asian-owned',
                        'Islander-owned',
                        'Hispanic-owned',
                        'Alaskan-owned',
                        'Minority-owned',
                        'Nonminority-owned']    
    
    DemographicInfo = [('with_formatter', 'text')]
    
    for dg in DemographicGroup:
        g = getCoordinates(pdf=file_pdf,query=dg,type_search="Line")
        d = next(g,None)
        DemographicInfo.append(tuple((dg,'LTTextLineHorizontal:in_bbox("%f,%f,%f,%f")'%(d["left_corner"],
                                                                                        d["bottom_corner"],
                                                                                        col_right_align,
                                                                                        d["upper_corner"]))))
    info = file_pdf.extract(DemographicInfo)
    return info
In [20]:
info1(file_pdf)
Out[20]:
{'Alaskan-owned': 'Native American/Alaskan-owned l 27.0%',
 'American-owned': 'African American-owned l 28.7%',
 'Asian-owned': 'Asian-owned l 35.4%',
 'Hispanic-owned': 'Hispanic-owned l 51.5%',
 'Islander-owned': u'Hawaiian/Paci\ufb01c Islander-owned l -16.9%',
 'Minority-owned': 'Minority-owned l 30.7%',
 'Nonminority-owned': 'Nonminority-owned l -8.6%'}

How about a full table?

Screen%20Shot%202018-11-14%20at%207.45.59%20PM.png

In [21]:
def getTable(file_pdf, col_width, row_space, row_height,title,bottom_corner_dif,headers,col_left_align):
    
    table = list()
    table.append(headers)
    
    g = getCoordinates(pdf=file_pdf,query=title,type_search="Line")
    d = next(g,None)
    
    pageid = d['pageid']
    bottom_corner = d['bottom_corner'] - bottom_corner_dif

    while 1:
        columns = (c for c in xrange(len(headers)))
        boxes = list()
        for c in columns:
            boxes.append(tuple(('col_%s' %(c),
                               'LTPage[pageid="%s"] LTTextLineHorizontal:overlaps_bbox("%f,%f,%f,%f")' % (pageid,
                                                                                                          col_left_align[c],
                                                                                                          bottom_corner,
                                                                                                          col_left_align[c]+col_width,
                                                                                                          bottom_corner+row_height))))



        columns = [c for c in xrange(len(headers))]
        row = file_pdf.extract(boxes)
        columns = [row['col_{}'.format(c)].text() for c in columns]
        table.append(columns)
        if 'Total' in row['col_0'].text():
            break

        bottom_corner -= row_space
    return table
In [22]:
def info2(file_pdf):
    col_width = 35
    col_left_align = [50,295,371,449,532]
    row_space = 16.78
    row_height = 14
    bottom_corner_dif = 126.91
    headers = ['Industry',
                '1-499 Employees',
                '1-19 Employees',
                'Nonemployer Firms',
                'Total Small Firms'] 

    table = getTable(col_left_align=col_left_align,
                     col_width=col_width,
                     file_pdf=file_pdf,
                     headers=headers,
                     row_height=row_height,
                     row_space = row_space,
                     bottom_corner_dif=bottom_corner_dif,
                     title = "Table 1")
                     
    return table
In [23]:
info2(file_pdf)
Out[23]:
[['Industry',
  '1-499 Employees',
  '1-19 Employees',
  'Nonemployer Firms',
  'Total Small Firms'],
 ['Retail Trade', '10,674', '9,627', '27,992', '38,666'],
 ['Other Services (except Public Administration)',
  '10,042',
  '9,332',
  '63,575',
  '73,617'],
 [u'Professional, Scienti\ufb01c, and Technical Services',
  '8,081',
  '7,378',
  '31,099',
  '39,180'],
 ['Health Care and Social Assistance', '7,823', '6,670', '21,808', '29,631'],
 ['Construction', '7,143', '6,373', '39,463', '46,606'],
 ['Accommodation and Food Services', '5,525', '4,255', '4,889', '10,414'],
 ['Wholesale Trade', '3,785', '2,974', '5,061', '8,846'],
 ['Manufacturing', '3,377', '2,349', '4,425', '7,802'],
 ['Administrative, Support, and Waste Management',
  '3,355',
  '2,842',
  '37,265',
  '40,620'],
 ['Finance and Insurance', '2,916', '2,582', '7,842', '10,758'],
 ['Real Estate and Rental and Leasing', '2,799', '2,590', '29,081', '31,880'],
 ['Transportation and Warehousing', '2,197', '1,834', '12,669', '14,866'],
 ['Arts, Entertainment, and Recreation', '1,003', '860', '11,253', '12,256'],
 ['Agriculture, Forestry, Fishing and Hunting',
  '768',
  '715',
  '4,378',
  '5,146'],
 ['Educational Services', '746', '574', '6,894', '7,640'],
 ['Information', '617', '489', '2,930', '3,547'],
 ['Mining, Quarrying, and Oil and Gas Extraction', '149', '103', '698', '847'],
 ['Utilities', '92', '64', '256', '348'],
 ['Total', '71,092', '61,611', '311,578', '382,670']]

Another example

In [25]:
def info3(file_pdf):
    col_width = 35
    col_left_align = [50,325,400,532]
    row_space = 13.5
    row_height = 12.4
    bottom_corner_dif = 115.5

    headers = ['Industry',
               'Small Business Employment',
               'Total Private Employment',
               'Small Business Emp Share']    
    
    table = getTable(col_left_align=col_left_align,
                     col_width=col_width,
                     file_pdf=file_pdf,
                     headers=headers,
                     row_height=row_height,
                     row_space = row_space,
                     bottom_corner_dif=bottom_corner_dif,
                     title = "Table 2"
     )

    return table
In [26]:
info3(file_pdf)
Out[26]:
[['Industry',
  'Small Business Employment',
  'Total Private Employment',
  'Small Business Emp Share'],
 ['Health Care and Social Assistance', '113,580', '240,549', '47.2%'],
 ['Accommodation and Food Services', '89,707', '161,421', '55.6%'],
 ['Retail Trade', '87,257', '222,277', '39.3%'],
 ['Manufacturing', '79,632', '242,093', '32.9%'],
 ['Other Services (except Public Administration)',
  '68,770',
  '80,073',
  '85.9%'],
 ['Construction', '65,147', '78,318', '83.2%'],
 [u'Professional, Scienti\ufb01c, and Technical Services',
  '57,856',
  '92,520',
  '62.5%'],
 ['Administrative, Support, and Waste Management',
  '44,577',
  '133,720',
  '33.3%'],
 ['Wholesale Trade', '44,232', '72,175', '61.3%'],
 ['Finance and Insurance', '24,832', '69,332', '35.8%'],
 ['Transportation and Warehousing', '24,484', '58,471', '41.9%'],
 ['Real Estate and Rental and Leasing', '15,577', '23,257', '67.0%'],
 ['Educational Services', '13,791', '28,969', '47.6%'],
 ['Arts, Entertainment, and Recreation', '11,858', '17,165', '69.1%'],
 ['Information', '9,854', '34,447', '28.6%'],
 ['Agriculture, Forestry, Fishing and Hunting', '5,622', '6,356', '88.5%'],
 ['Mining, Quarrying, and Oil and Gas Extraction', '2,650', '7,942', '33.4%'],
 ['Utilities', '2,094', '17,238', '12.1%'],
 ['Utilities Total', '2,094 761,520', '17,238 1,586,323', '12.1% 48.0%']]

How about several pdf's at the same time?

In [27]:
def process_file(path_file):
    file_pdf = load_pdf(path_file)
    d = dict()
    d['file'] = path_file
    d.update(info1(file_pdf))
    x = info2(file_pdf)
    d['industry'] = x
    x = info3(file_pdf)
    d['employment'] = x
    return d
In [28]:
# https://stackoverflow.com/questions/29494001/how-can-i-abort-a-task-in-a-multiprocessing-pool-after-a-timeout
def abortable_worker(func, *args, **kwargs):
    timeout = kwargs.get('timeout', None)
    p = ThreadPool(1)
    res = p.apply_async(func, args=args)
    try:
        out = res.get(timeout)  # Wait timeout seconds for func to complete.
        return out
    except multiprocessing.TimeoutError:
        print("Aborting due to timeout ")
        p.terminate()
        raise
In [29]:
if __name__ == '__main__':    
    result = list()
    pool = multiprocessing.Pool(maxtasksperchild=1)
    files = list_files(s3)
    files = files[0:4]
    for i in files:
        abortable_func = partial(abortable_worker, process_file, timeout=60)
        path_file = path(s3,i)
        pool.apply_async(abortable_func, args=(path_file, ), callback=result.append)
    pool.close()
    pool.join()
s3://eh-home/ehda-calvin/SBA_study/pdf/Alaska.pdf
s3://eh-home/ehda-calvin/SBA_study/pdf/Alabama.pdf
s3://eh-home/ehda-calvin/SBA_study/pdf/American_Samoa.pdf
s3://eh-home/ehda-calvin/SBA_study/pdf/Arizona.pdf

Analysis

In [ ]:
 
In [ ]:
 
In [ ]:
## Difficulties, customize, when no read and to do in S3